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The traditional storage approaches are being challenged by huge data 
volumes. In multimedia content, every file does not necessarily get tagged as 
an exact duplicate; rather they are prone to editing and resulting in similar 
copies of the same file. This paper proposes the similarity-based 
deduplication approach to evict similar duplicates from the archive storage, 
which compares the samples of binary hashes to identify the duplicates. This 
eviction is done by initially dividing the query video into dynamic key 


Keywords: frames based on the video length. Binary hash codes of these frames are then 
Archive compared with existing key frames to identify the differences. The similarity 

: , score is determined based on these differences, which decides the eradication 
Comparison window strategy of duplicate copy. Duplicate elimination goes through two levels, 
Deduplication namely removal of exact duplicates and similar duplicates. The proposed 
Hash codes approach has shortened the comparison window by comparing only the 
Multimedia candidate hash codes based on the dynamic key frames and aims the accurate 


lossless duplicate removals. The presented work is executed and tested on the 
produced synthetic video dataset. Results show the reduction in redundant 
data and increase in the storage space. Binary hashes and similarity scores 
contributed to achieving good deduplication ratio and overall performance. 
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1. INTRODUCTION 

The multi-objective deduplication technique which removes the duplicate files is making its 
relevance and applicability in the storage world. In the view of digital data, its growth [1] and its storage 
challenges; the progress of deduplication can be seen for textual data [2], [3], [4], [5], [6] and relational 
data [7], [8]. Comparatively less research is observed for multimedia files. There is a huge multimedia 
universe within the data universe which is growing fast in the popularity because of social media, online 
courses, distance learning, presentations, self-study and growth of smartphone devices including mobile 
phones and tablets. With the growth of internet usage along with an increase in internet bandwidth, there has 
been a significant increase in online video data streaming along with video data downloading. Videos are 
becoming a powerful e-learning trend. From education viewpoint; students, researchers, professors, and 
individuals also prefer watching videos instead of reading manuscripts or books. The majority of them prefer 
downloading the videos, which results in a huge video basket resulting the need for data reduction in case of 
redundancy. Widely used methods for reduction are compression and deduplication. MPEG [9] addresses the 
compression of individual videos but not similar videos. 

Inline deduplication [10] limits the storage of a duplicate file. The search for duplicates occur before 
the file is stored, but this is very time-consuming and resource-intensive process; particularly with the video 
files wherein CPU and memory resources are kept busy till the process of video streaming, and a duplicate 
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check is over. While in this work, we focus on archive storage and propose the post-process deduplication on 
the videos with the dynamic splitting of the videos and reduced comparison candidate key frames based on 
video length and duplicate removal window. 

The prime objective of deduplication is to reduce multiple redundant copies. Deduplication 
performance depends on the factors such as- comparison window, space savings, metadata lookup and above 
all the accuracy percentage of duplicate content. The performance rises with more duplicates and more 
similarity. Unlike other text deduplication methods, duplicate removal ratio is improved in multimedia 
content by identifying similar duplicates. Video similarity can occur with respect to resolution, formatting, 
frame rate and content difference. The existing written work focuses on the progress of similar features 
among the videos. In [11], the author proposes the framework, aiming an efficient and fast duplicate removal 
process. Key factors are- hash table indexing, video similarity and temporal locality of the frames. The author 
creates multiple buckets and each bucket stores the extracted frames sharing the same hash code. Frames 
extracted in the bucket are queried for finding the similarity and removing the duplicates. 

Near-duplicate removal based on shot-similarity is proposed in [12], to identify the duplicate scene 
shots. It focuses on the same event captured with different cameras, angles, and different temporal offsets. A 
similarity between a set of trajectories is calculated for finding the duplicates. The similarity is measured for 
strict near duplicates, object duplicates, and scene duplicates. The authors of [13], [14] have proposed a 
secure video deduplication framework based on H.264 compression and a block level codec used for high 
definition videos. A security parameter is focused on encryption, proof of storage, proof of ownership and 
retrieval. However, shot and scene similarity and security aspects are not within the scope of this paper. 

Hashing based on multiple features for detecting near-duplicate videos is illustrated in [15]. 
Similarity and performance are evaluated by calculating Euclidean distance and mean average precision. The 
research presented in [16] introduces an efficient yet effective novel global descriptor; where non-geometric 
distortions and all other transformations for video format files result in detection of near duplicate video copy 
tasks. For processing videos, they are converted into image frames; authors of [17] investigated Euclidean 
distance and shape based similar image retrieval by means of Zernike moments. The validity of digital 
images is verified by comparing histogram and principal component analysis of two images by the 
authors of [18]. 

This paper focuses on removing duplicate and similar videos from the backup or archive storage. 
The archive is a collection of older and inactive data which is hardly ever accessed. It deals with long-term 
data preservation. We intend to reduce the multimedia data size of the archives by performing deduplication 
on the videos. Comparison window is shortened by an effective sampling of the required key frames based on 
length of the video. In addition, after finding the similarity, the copy which occupies less space is preserved 
in the storage. It is a two level approach, where the video is checked for the exact duplicates; this results in 
the removal of a duplicate copy if an exact content match is found. If the exact duplicate copy is not found, 
then the system searches for the near duplicate ones and they are removed by similarity ranking of the frames 
based on their binary hashes. 

The order of the remaining sections in this article is structured as follows. Description of design and 
algorithm for removing near-duplicate videos is expressed in Section II. Besides it, the section also illustrates 
a model describing the data growth and data removal rate in an archive. Performance analysis and results 
stating the storage savings are illustrated in section III. The work is concluded in Section IV. 


2. SYSTEM PROCEDURE 
2.1. Video Redundancy Elimination 

Deduplication is a partition dichotomy of the videos into two opposed classes={duplicates, non- 
duplicates} based on their fingerprint values. Videos are prone to editing trends; which results in its alteration 
and filtration. With respect to generic video editing trend; the duplicate class can be further alienated into two 
subclasses: {near-duplicates, no-duplicates} based on similarity percentage among the videos. Considering 
the scenario that there are V videos, n copies of the video and every video takes the storage space Vg, then the 
total storage space overhead Totalspace and the duplicate degree Dupgegree Can be computed as given in 
Equation (1) and (2): 


V 
Totalspace = X na (1) 
i=1 
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Total space 


D = 
UPaegree space without duplicates ©) 


The target is how to find these duplicate copies n to minimize the required storage space and 
duplicate degree. The system architecture for the application which aims to control the media crowd in the 
archive and lessen the number of copies n is as exposed in Figure 1. The application implements duplicate 
video elimination process in two phases namely, duplicate check based on the video signatures and similar 
frames. Description of these modules is given in Algorithm 1 and Algorithm 2. Table I brief various 
notations and methods used in these algorithms. 


Table 1. Glossary of Notations and Methods 


Notations Methods 
Notation Meaning Method Meaning 
Q Query Video Video Dedup() Find duplicate videos 
Ha Hash algorithm for video signature hes Hash Load fingerprints available in the metadata 
Fingerprints() 
Hashyp Metadata hash fingerprints Finger Print() Calculate hash signature of video 
Duptype Type of duplicate — near/no Update Metadata() Passthe EER for the duplicate 
HCump Collection of hash codes in metadata Similarity Check() Check similarity score between the videos 
KFov Query video key frames Get n Load hash codes available in the metadata 
Collection of hash codes in query video . F ‘ 
HCQkg key frames Extract Frames() Split query video into key frames 
Hug Hash algorithm for key frames Calculate Hash() Calculate hash code of key frames 
CandHCQys Query video candidate hash codes Select Candidates() Select comparison candidate key frames 
CandHCyp Metadata video candidate hash codes 
Disty Distance between hash codes x and y 
Parameter for sampling candidate hash 
£ codes 
Si List of similarity score of candidate 
TM score hash codes 
finalSim.core Final similarity ranking 
wo Decision parameter for near-duplicates 
Parameter for selecting candidate key 

Y frame 

Query Duplicate 

— Signature Duplicate 1. Remove the copy 

> ieee pais ———* 2. Update the reference 


Non-duplicate 4 
! 

I 

Similarity Check 1 

{u Videoframes ! 

I 

= Candidate i 

frames i 

J 


Database Frames 


Figure 1. System architecture 


Algorithms seeking the reduction of storage space are presented below. Algorithm 1 describes the 
high-level operational flow of the deduplication process. Input is the query video and hash algorithm [19] 
which uses avalanche effect. The fingerprint or hash signature of query video is matched against the existing 
fingerprints in the metadata. On the exact match; a duplicate copy is removed, and metadata is updated by 
passing the reference link for the duplicate copy. If a match is not found query video is given to the method 
described in Algorithm 2 for finding its similar copy. Algorithm 2 exhibits the process for finding similarity 
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scores of the videos. Algorithm loads the key frame hash codes available in the metadata corpus. The length 
of metadata corpus is as given in Equation (3), where v is number of videos and kf is number of key frames. 


|HCyp| = Xizi Da A Cups (3) 


Query video is divided into KFg, key frames. Length of KFg, is dynamic and it differs as per the 
video length. Hash codes [20] are calculated for all these key frames. Understanding the length of metadata 
corpus|HCyp|, candidate hash codes CandHCQ;¢ and CandHCyp are selected from query hashcodes and 
metadata hashcodes based on the parameter q as illustrated in Algorithm 3. The value of t is kept dynamic as 
per the value of KFg,.For each candidate hash codes, distance is calculated by comparing the codes as per 
Equation (4). Similarity score is calculated from these distances as shown in Equation (5), for each candidate 
hash codes l. After calculating the individual similarity score, the final similarity ranking of query video is 
calculated by taking an average of the individual scores. If this final score is bigger than the stated threshold 
w , query video is measured as a near-duplicate one as described in Equation (6). The final score less than the 
threshold w, means query video is not a duplicate and is considered as a new video file. This new video is 
stored in the storage and key frame details and their respective hash codes are stored in the metadata. 


Considering x = CandHCQx, and y = CandHCypwe get; Dist} = ye as be = Doel 


(4) 
l 
. f CandHCQkf; 
SiMscore VV Distcanancup, 
i=1 : (5) 
= es — duplicate, finalSimgcore = W (6) 
Q = no — duplicate, otherwise 


Algorithm 1: Video deduplication based on hash Algorithm 2: Video deduplication based on hash 


signatures similarities 
Input: Query video Q,; Hash Algorithm H, Input: Query video Q, 
Output: Duplicate type — duplicate or no-duplicate Output: Duplicate type{near-duplicate, no-duplicate} 
Begin video Dedup Begin similarity Check 
Hashyp[ | — getHashFingerprints() HCyp[ || ] — getFrameHashCodes() 
fp — fingerPrint (Ha Q,); KFo,y[ ] <— extractFrames (Q,); //Dynamic 
Search fp of the video in metadata Hashyp| | for each (key frame in K Fo,[ ]) 
if fp = Hashyp| | then HCQx¢[] — calculate Hash(K Foy, Hyp) 
Video is duplicate; remove the copy End for 
! Update Metadata() //select candidate hash codes 
oe oe CandHCQ;,[ ] <selectCandidates( HCQ,¢[ ], T) 
DuPtype = similarityCheck(Q,) Candy ] <selectCandidates( H Cup ia 1) 
if Duprype = Near-duplicate then for each( candidate in CandHCQx,[] and 
Video is Unique; save the copy CandHCyp(]) 
else // video is duplicate Calculate Simgcore| ] 
update Metadata() Sud for 
end if 


finalSimscore S average(SiMscore [ D 
if ( finalSiMscore Z w ) then 

DuPtype = “Near-duplicate’ 
else 

DuDPtype = “No-duplicate’ 


End 


End 


Algorithm 3 illustrates the steps for selecting sample candidate key frames for the comparison. 


Int J Elec & Comp Eng, Vol. 8, No. 5, October 2018 : 3221 - 3231 


IJECE ISSN: 2088-8708 O 3225 


Algorithm 3: Candidate key frame selection 
Input: HC[ ] Collection of hashcodes in metadata or query video and t deciding the candidates 
Output: Candyc[ | 
Begin select Candidates 
for each (hash code k in HC[ }) 
if (t == 1) then 
Candyc| |] — HC[k] 
else 

Candycl ] — HC[k+y] 

End 


2.1. Probability Model for Similar Video Deletion 

This section illustrates the probabilistic and deterministic model describing the fundamental 
relationships between exact and near-duplicate videos, their storage space requirement and the rate of their 
growth and removal. The probabilistic experiment involves random variables namely, video hash signatures 
and similarity score. Given a sample space Q of V videos; where the probability of duplicate video is p and 
non-duplicate is (1-p), then the probability of getting exactly k duplicate videos based on the hash signatures 
is as shown in Equation (7). 


P(k Duplicates) = 4) p“ (1-p) Y* x 


As videos are editing prone, the process of redundant video removal as near-duplicates is further processed 
based on its similarity score SiMscore and a threshold value w. Threshold w is the parameter which illustrates 
about the similarity score, and the value is less than 1. Probability that the SiMsşcore takes values in the 
interval from w to 1 is same as the probability of all possible outcomes where this score can come. The 
Probability density and expected value of SiMscore for its numerical possibilities s are described in Equation 
(8) and (9). 


. 1 
P(w £ SiMscore = 1) = es fsimscore (s)ds (8) 


1 
E(Simscore) = l S fsimscore (s)ds (9) 
w 


A discrete time model illustrated in Figure 2 describes the evolution from video arrival for the 
backup to its storage process. The model has six possible states and seven possible transitions between the 
states annotated with their representative transition probabilities. At any known time; after the arrival, video 
can move to either of the states such as duplicate, similar or non-duplicate based on the video characteristics 
(fingerprint, similarity score). Once the video is identified as a duplicate or more similar, the system goes to 
the video reference state and stays in the same state. The system stays in the unique video state if there is no 
match on fingerprints and videos are not similar. 


Video Arrival 


0.5 


0.3 


Duplicate 


Video 
Reference 


Figure 2. Probabilistic discrete time model for video storage 
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After n transitions the total probability of system 1,, starting at state s and ending at state t,V,, and 
k intermediate states is given in Equation (10). 


k 
Bern) =) u(r = Pe (10) 


i=1 


Total probability of system starting in state 1 — “video arrival” and ending in the state 5 — “Reference” is as 
illustrated below in the Equations (11) to (13) {state 2- “Duplicate” and state 3- “Similar” } 


s(n) = 242 (n — 1)Pz5 + Ay3(n — 1)P35 (11) 
3,6(n) = 0.3*14+03*05 (12) 
(n) = 0.45 (13) 


Transition matrix for the model described in Figure 2 is as illustrated in Table 2 below. The probability of 
going from “video arrival” to “duplicate” is 30%. Initial state of system is [1,0,0,0,0,0] i.e. first state is “video 
arrival”. At one-time unit 30% videos are gone into “duplicate” state, 30% videos are gone into “similarity” 
state, and 40% of the videos are in “non-duplicate” state. 


Table 2. Probability Transition Matrix 


States* VA DUP SIM ND UNI REF 
VA 0 0.3 0.3 0.4 0 0 
DUP 0 0 0 0 0 1 
SIM 0 0 0 0.5 0 0.5 
ND 0 0 0 0 1 0 
UNI 0 0 0 0 1 0 
REF 0 0 0 0 0 1 


*States(VA-Video arrival, DUP-Duplicate, SIM-Similar, ND-non-duplicate, UNI-Unique video, REF-Video reference) 


Probability Distribution tells that after running the process for n videos (time units), the system travels to a 
“unique video” state with probability p, and with probability p,, system goes to a “reference video” state. 
The simulation was done on n=100 videos and total probability mentioned in the Equation (10) is illustrated 
below along with the probability distribution with an assumption that the system already had 30 duplicate 
videos. 


P” is given by: [0.00 0.00 0.00 0.00 0.55 0.45] 

[0.00 0.00 0.00 0.00 0.00 1.00] 

[0.00 0.00 0.00 0.00 0.50 0.50] 

[0.00 0.00 0.00 0.00 1.00 0.00] 

[0.00 0.00 0.00 0.00 1.00 0.00] 

[0.00 0.00 0.00 0.00 0.00 1.00] 
Probability distribution after running the process for 100 videos is - [0.0, 0.0, 0.0, 0.0, 0.55, 0.75] 
Le. p, = 0.55 and p, = 0.75 i.e. (0.45 + 30) 
The deterministic model defines the formulation of inputs and outputs over the time where inputs are the 
number of duplicate arrivals and outputs are the number of duplicate removals. The rate of change of storage 
space depends on the rate of video arrivals and rate of video removals as stated in Equation (14). 


R(storagespace) = R(vide0arrivais) E R(vide0;emovats) (14) 
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Considering Sọ as the initial value of storage space, with duplicate video arrival and duplicate video 
removal rate per file as a and f, the storage size S(t) at any time t can be obtained by rewriting Equation 
(14) as given in Equation (15). 


R(storagespace) = S(t) =a S(t) — B S(t) (15) 
dS ş Š 
g7 TP (16) 


Let r = a — f in Equation 16, we obtain 


dS 
— = rS (17) 
dt 


After solving this differential equation, we get - 


S(t) = sge™ (18) 

According to Equation (18), the rate of change of storage space on the removal of duplicates 
depends on the value of r, and it should be greater than zero. This paper aim to keep the value of r (duplicate 
video arrivals-duplicate video removals) > 0 by removing more duplicate copies. 


3. RESULTS AND DISCUSSION 

In this section, we firstly compare the compression ratio between the videos by comparing the 
hashes generated by avalanche effect and the binary hashes. Secondly, the performance of similarity score 
and precision-recall is observed. Experimentation was implemented on the combined dataset by taking a few 
videos from the dataset [21]. A synthetic dataset is generated according to the principles discussed in [22] on 
the videos from [23], YouTube videos and a personal video collection. Dataset videos are either exact 
duplicates or approximate similar, but different in resolution, encoding formats, content addition, length, etc. 
Videos used in the standard datasets are too small. Unlike [21], the dataset used for our application contains 
videos of all time durations from 1 minute to more than 10 minutes. 

We first adopt the method of avalanche and it is compared against the binary hashes. Figure 3 shows 
the results of three various reduction methods (HE- Hash based exact duplicate removal, HS- Hash based 
similar duplicate removal, SS- Similarity-based similar duplicate removal) on two different datasets; dataset1 
is a video collection with exact duplicates; dataset2 contains exact and similar duplicates. Reduction methods 
are HE- Hash based exact duplicate removal, HS - Hash based similar duplicate removal and SS - Similarity- 
based similar duplicate removal. HE is adopted on dataset1 and it is observed that data size got reduced to 
12%, 28%, and 34% respectively with compression(C), deduplication (DD) and compression plus 
deduplication (CDD). HS and SS techniques are implemented on the common dataset2 with exact and similar 
duplicates. With HS the reduction percentage of C, DD, and CDD is observed as 5%, 19%, and 22% because 
hash based algorithm doesn’t detect similarity features. Significant improvement in the reduction percentage 
is seen with SS approach as 5%, 46%, and 50% fulfilling the aim of data reduction and an increase in the 
storage space. 


T E Original 
o Compression 
T o y m Deduplication 
a E Compression and Deduplication 
w 
[s) N J 

2) 

i=) 

HE HS s$ 


Figure 3. Compression vs deduplication performance 
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Figure 4 plots the graphical representation of the spread of all different video similarity scores 
among the dataset. Whiskers show the range of similarity scores for the entire dataset. Lowest score in this 
sample is 0 and maximum it goes to 1. Median similarity score in the dataset is approximately at 0.7. The 
first plot shows the representation of similarity score of all the test data. The second plot shows the result of 
the near duplicate ones when the threshold w was kept 0.7. The rectangular box ranges from the 0.7 to 0.9 
with whiskers extending up to maximum 1.0. The last plot is the result of no duplicate ones with scores 
below 0.7 and an outlier at score 0.0. 


Similarity Scores 
ao 
ob 
Similarity Scores 


Od ! 
0:2 0:2 
o z á 04 A A ne 
T T T T | | 
Total.Scores  Near.Duplicates No. Duplicates fps1_5 fps10 fps15 
Figure 4. Distribution of similarity scores (Box plots Figure 5. Comparison of similarity scores with 
depicting median and quartile values of similarity different key frame sizes (Box and whisker plots 
scores for entire dataset, near duplicates and no depicting the comparison of similarity scores for 
duplicates) different key frame sizes) 


Figure 5 describes the comparison of similarity scores with different key frame sizes. As per 
algorithm 2 frame extraction is kept dynamic as per the video length. From the figure, we conclude that when 
our proposed system sets the fps (frames per second) ranging 1-5, we get the maximum similarity score from 
0.45 to 0.95 with whiskers extending up to 1.0 along with an outlier towards 0, and median 0.75 is closer to 
the upper quartile. Whereas for fps 10, the similarity score varied from 0.55 to 0.9 with whiskers extending 
up to lower quartile 0.0 and upper quartile 1.0. In the last case, for fps 15 we got the rectangular box with a 
range of 0.5 to 0.9. The respective medians indicated from the graph are 0.75, 0.7, and 0.65. 


ROC Precision-Recall 
4.00 4.0 
D 
= 0.75 5 0.75 
= 2 
a — wn 
2 oO 
& g 
ol50 ii 0.50 
0.2 0.25 
0.0 
T T T T T 9:00 T T T T 
0.00 0.25 0.50 0.75 1.0: 0.00 0.25 0.50 0.75 1.0( 
1 - Specificity Recall 


Figure 6. Performance of similar video removal process 
(A precision-recall curve depicting the accuracy of similar video detection with w =0.7. Receiver operating 
characteristic curve (ROC) viewing the performance of the near duplicates and no duplicates classification) 
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Figure 6 shows the accuracy and performance of the near similar and no duplicates with the 
threshold w as 0.7. Precision is the portion of identified scores that are similar, while recall is the segment of 
similar scores that are identified. Precision-recall curves are used to identify the confident zone of the overall 
performance, where a larger area indicates the better performance. Accuracy of the experiment results is 
measured by the region below the ROC curve. The region towards 1.00 represents a perfect classification. As 
shown in the Figure 6, area of ROC curve is approximately 0.8-0.9 i.e. towards 1. Taking false positive and 
false negative into account, the weighted mean i.e. the F1 score is evaluated and the value is 0.89. 

Figure 7 shows three ROC curves for threshold w values 0.7, 0.75, 0.8 respectively. The excellent 
accuracy result is being depicted by red curve (w=0.7) by calculating the area of ROC and it is observed 
towards 1. PR curve is thought to be more informative in finalizing the threshold value which gives the 
similarities as predicted. The larger curve area is been observed by red curve with w=0.7. Hence, the 
similarity score threshold for detecting more similar videos is measured as 0.7. Videos with 
SiMscore >= 0.7 are considered as duplicates and are deleted from the corpus. 


ROC Precision-Recall 
4.0 


Sensitivity 
Precision 


T 
0.00 0.25 0.50 0.75 1.0: 0.00 0.25 0.50 0.75 1.06 
1 - Specificity Recall 


0.70 0.75 0.80 


Figure 7. Performance using different similarity score threshold 
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Figure 8. Rate of change of storage space 


As per the deterministic model, discussed in Section II, the rate of change of storage space depends 
on the rate of duplicate video arrivals and rate of duplicate or similar video removals. Removing duplicates is 
inversely proportional to storage space. Figure 8 shows this rate of change and describes that as the video file 
redundancy count decreases i.e. as duplicate or similar video files are removed; there is an increase in the 
storage space. 
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4. CONCLUSION 

This paper presents the efficient duplicate video removal process based on the video similarities. 
The reported work shows that deduplication with similarity scores perform well as compare to deduplication 
with hash signatures on the dataset where there are more similar contents. Original data size has significantly 
reduced with 50% deduplication ratio. The accuracy of similar video detection is observed with the precision- 
recall curve with larger curve area and ROC curve area towards 1. The accuracy of video classification as 
similar duplicates and no duplicates is measured with the F1 score (mean of precision and recall) of 89%. 
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